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Abstract A selective sweep describes the reduction of linked genetic variation 
p^ due to strong positive selection. If s is the fitness advantage of a homozygote 

for the beneficial allele and h its dominance coefficient, it is usually assumed 
that h = 1/2, i.e. the beneficial allele is co-dominant. We complement existing 
theory for selective sweeps by assuming that h is any value in [0, 1]. We show 
that genetic diversity patterns under selective sweeps with strength s and 
CN dominance < ft. < 1 are similar to co-dominant sweeps with selection strength 

^^ 2hs. Moreover, we focus on the case ft, = of a completely recessive beneficial 

rrl allele. We find that the length of the sweep, i.e. the time from occurrence until 

p. fixation of the beneficial allele, is of the order of \/N/s generations, if N is the 

-^ population size. Simulations as well as our results show that genetic diversity 

. ,_| patterns in the recessive case ft, = greatly differ from all other cases. 

I Keywords Genetic hitchhiking • selective sweep • beneficial mutation • 

recessive allele • genealogy 
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> 
CO 1 Introduction 

in 

^*| The model of selective sweeps (also called genetic hitchhiking) predicts a re- 

l/~) duction in sequence diversity at a neutral locus closely linked to a beneficial 
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allele (Maynard Smith and Haigh 1974). Most analysis of this model assumes 



that the beneficial allele is co-dominant. Accordingly, genome scans for the 
evidence of recent positive selection test a neutral model against (strong) se- 
lection for a co-dominant allele. Many such methods use information about 
the expected site frequency spectrum under a hitchhiking model to detect sig- 



natures of positive selection (e.g. Kim and Stephan, 2002 Nielsen et al. 2005 
Jensen et al.[ 2005). Simpler approaches use test statistics such as sample 



heterozygosity (usually called tt), or Tajima's D to reject a standard neutral 



model. Simulation results by Teshima and Przeworski (2006) and Teshima 



et al. (2006) show that the false-discovery and false-negative rates of such 



methods increase if selection acts on a recessive rather than a co-dominant 
allele. 

Although adaptations are often assumed to be rather dominant than re- 



cessive ( Charlesworth 1998), also the case of recessive beneficial alleles is well 



documented in the empirical literature. Many cases that have been described 
concern resistance alleles. Here, the loss of a function of a gene conveys resis- 
tance to a pathogen. Often only the homozygous mutant is resistant, leading 
to a recessive sweep. Examples include: (i) The Duffy blood group locus in hu- 
mans, where the homozygous nuU-allele (FY-0) confers complete resistance to 



vivax malaria. Hamblin and Di Rienzo ( 2000 ) report that the FY-0-genotype 



is at or near fixation in most sub-Saharan African populations but is very rare 
outside Africa, which suggests that it is locally under strong positive selection, 
(ii) Resistance to the yellow mosaic virus disease in barley has been mapped 
to several recessive resistance genes ( Ordon et al. 2004 ) . (iii) The plant gene 



eIF4E (present e.g. in pepper, pea and tomato) is a factor involved in ba- 
sic cellular processes and can be used by viruses to complete their life cycle. 
Only if the function of both gene copies is compromised, the plant is resistant 
and positive selection can act ( Cavatorta et al.[ [2008). (iv) The yellow fever 
mosquito Aedes aegypti is resistant to the drug permethrin, if both copies carry 
a replacement mutation in the gene para, as shown in [Garcia et al. (2009). 
Starting with the original publication by Maynard Smith and Haigh ( 1974 1, 



an extensive body of analytical theory has been established for the hitchhiking 



model (e.g. Kaplan et al. 1989 Stephan etal. 1992 Barton 1998 Durrett and 



Schweinsberg 2004 Etheridge et al. 2006). In addition to results on reduced 



diversity and the frequency spectrum, linkage disequilibrium has been studied 



by Stephan et al. (2006); McVean (2007); Pfaffelhuber and Studeny (2007) 



Leocard and Pardoux (2010). Moreover, the model was extended to the case 



of multiple origins of the beneficial allele due to mutation or from standing 



genetic variation (Pennings and Hermisson 2006a Hermisson and Pennings 



2005 Pennings and Hermisson 2006b Hermisson and Pfaffelhuber 2008 ) . All 



these results are built around the simplest possible scenario for adaptation, 
where positive selection acts on a single locus without dominance. Despite 
its empirical importance, dominance was only studied quite recently. [Teshima] 
and Przeworski ( 2006 ) use computer simulations to demonstrate the impact 



of intermediate dominance on the most important summary statistics for the 



frequency spectrum, see also Teshima et al. (2006). Explicit analytical results 
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are even more sparse and only exist for the fixation time (duration of the 



sweep) of the beneficial allele (van Herwaarden and van der Wal[ 2002). The 



case of a completely recessive beneficial allele, in particular, has not been 
treated in any of these publications. 

The goal of our investigation is the extension of previous analytical re- 
sults to the case of arbitrary dominance. We focus, in particular, on the com- 
pletely recessive case. While sweeps with intermediate dominance, < h < 1, 
and selection coefficient s produce diversity patterns similar to a co-dominant 
beneficial allele with selection coefficient 2hs (Theorem [4]), our results show 
that recessive sweeps, h = 0, are qualitatively different (Theorem [2]) . Also, 
for the probability of fixation and the duration of the sweep, the recessive 
case is widely different from other modes of dominance. See Proposition |3.1| 
and Theorem [T] for the recessive case and Proposition |4.1| and Theorem [3] for 
0<h<l. 

The paper is organized as follows: in Section [2] we introduce the model 
for selective sweeps with arbitrary dominance coefficient, both at the selected 
locus (Subsection 2.1) and at the neutral locus (Subsection |2.2[ ). In Section [3J 



we give our results on sweeps of recessive alleles. Section |4] contains our results 
for sweeps in the cases < h < 1. In Section [5] we describe sequence diversity 
patterns under recessive sweeps using simulations and compare them with the 
case < h < 1. We conclude with the proofs in Section [6J 

2 The model 

We use discrete (Wright-Fisher) models as well as diffusion processes for mod- 



eling allelic frequency paths (Section 2.1). In order to study genetic diversity 



patterns, we use a structured coalescent (Section [2^2 



2.1 The allelic frequency path 

Consider a one-locus Wright-Fisher model, consisting of N diploid (and hence 
2N haploid) individuals. The beneficial mutant is B and the wildtype allele is 
denoted b. The (relative) fitness of diploids is given as follows: 

Genotype BB Bb bb 

Relative fitness 1 + s I + sh 1 

We are interested in the dynamics of (^f')t=o,i,2,...j where X^ is the fre- 
quency of the beneficial allele B in generation t. This process is a Markov 
chain and given X^ — ij^ , the transition probabilities are 



where 



i'^il + s) + i{2N - i){l + sh) 
^2(1 + s) + 2i{2N - i)(l + sh) + {2N - iy 
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For N — »■ cc, s — » such that 2Ns -^ a, and X^ => Xg, it is well-known 



(e.g. Ewens 2004) that {XK^^At>o => {Xt)t>o, where '=>' means convergence 
in distribution (in the space of real- valued functions, equipped with uniform 
convergence on compacts), where X :— {Xt)t>a is the diffusion, uniquely de- 



termined by the stochastic differential equation (SDE) given in (2.1). 



Definition 2.1 (Allelic frequency path) The diffusion X ~ {Xt)t>o is the 
unique solution of the SDE 

dX = a{h + X{1 - 2h))X{l - X)dt + y/X{l - X)dW, (2.1) 

where W is a standard Brownian motion. We set 

To := mi{t >0:Xt^O}, Ti := mi{t > : Xt = 1}, (2.2) 

which are the times of loss and fixation of the beneficial allele, respectively. 
Moreover, X* — (X^)t>Q is the process X , conditioned on the event {Ti < Tq\. 
We set 

T* := inf{t >0: X* ^ 1}, (2.3) 

which is the time of fixation of the beneficial allele. If not mentioned otherwise, 
we assume that Xq = and X* arises as limit of conditioned processes which 
are started ine as e -^ 0. For a, h £ M, we denote the distribution of X , started 
in Xq = X, by P"'''[.]. Expectations and variances are denoted by ¥."'^[.] and 
byY^'^^l], respectively. 

Remark 2.2 (Fixation probability for a single mutant) Let P;^''''''[.] be 

the probability measure for the Wright-Fisher model with population size TV, 
selection coefficient s and dominance h, started in X^ — x. Note that the 
weak convergence {X^^^,)t>Q ^ {Xt)t>Q does not imply convergence for all 
interesting functionals of the Wright-Fisher model. In particular, convergence 
of fixation probabilities for a single mutant in the sense 



-!,N,SN,hrYN _ 1] 

■ 1/(27V) ^^oo — -^J AT^oo 



' [Xo 



> 1 



■ 1/{2N) 



with Nsn > Oi has only been proved in the case h = \ (see 



and Ewens 1995 p. 565). For this reason, all our assertions are stated in the 



Biirger 



diffusion framework. 
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2.2 The structured coalescent 

Consider a sample of size n, taken from the population at the time of fixation 
of the beneficial allele. From each of these individuals, consider homologous 
neutral loci, which are linked to the beneficial allele at recombination distance 
p (i.e. the probability that a recombination occurs between the beneficial and 
the neutral locus is r per generation in the Wright -Fisher model of diploid size 
iV, and 2Nr — >• p as iV — >■ oo). In order to study the genetic diversity within 



these neutral loci, we follow Kaplan et al. (1989) and Barton et al. (2004), and 



introduce the following coalescent process, which is conditioned on a path of 
X*. 

Defi niti on 2.3 (The structured coalescent K.) Let X* he as in Defini- 
tion 2.1 Conditioned on X* , we define a Markov process, K, = [Kf'^lO") = 
{K^a^ K^)o<p<T-' with f3 ^ T* —t. Here, K^n and K^ are the number of lines in 
the beneficial and wild-type background at time /3, respectively. Taking values in 
Zl, this process starts in (K^, K^) = {n, 0) for some n gN. If Kp = (A;^ k""), 
then there are transitions to 

1. {k^ -l,k'^) at rate i^!^) 



XI 



(two lines in the beneficial background are merged to a single line) 

2. (fc^ fc"' - 1) at rate ('=2™) ^_^l^ 

(two lines in the wildtype background are merged to a single line) 

3. (fc^ - 1, fc"" + 1) at rate fcV(l - Xt'-p) 

(one line in the beneficial background changes to the wildtype background) 

4. {k'' + l,k'" -1) at rate k'^pXr-'-p 

(one line in the wildtype background changes to the beneficial background). 

See Figure \2.1\ for an illustration of K, . 



3 Results on recessive alleles 

In this section, we focus on the case h — 0, i.e. on properties of the diffusion 
X - {Xt)t>o given by the SDK 

dX = aX'^il - X)dt + ^/X{1 - X)dW (3.1) 

and the corresponding diffusion X* , which is conditioned to hit 1 (recall from 



Definition 2.1). It is crucial to note that the process X* spends most of its 
time at low frequencies for /i = 0; see Figure [3J| ^A). The reason is that for 
low frequency, most beneficial alleles are found in heterozygotes and selection 
can hence not be efficient. In order to make this statement more quantitative. 



we will show (see Lemma 6.3), that (i/a-^t/y^)t>o converges to the diffusion 



{Yt)t>o given by dY = Y^dt + y/YdW. This implies that X* spends most of 



its time in frequencies of order 1/y/a. See also Figure 3.1 B) for an illustration 
of the process {-JaX^ 



G. Ewing et al. 



X 











3. 




2. 


3. ,^ y 




4. 






1. 










1 






1 


1 



t T 

Fig. 2.1 Illustration of the structured coalescent in the case h = 0. Pairs of lines can 
coalesce within the beneficial background (event 1.), or within the wildtype background 
(event 2.). Changes from the beneficial to the wildtype background (event 3.) or vice versa 
(event 4.) may occur as well. 
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Fig. 3.1 The frequency path of a beneficial recessive allele, conditioned on fixation, X*. 



We have used a Wright-Fisher model with N = 5 ■ IC* and s = 5 ■ 10 
The right figure shows the rescalcd frequency path {y/aX* ^)t>0- 



i.e. 2Ns = 500. 



We give our main three results on fixation probability (Proposition 3.1) 
actually already derived by Kimura ( 1962 ) - the duration of the recessive 



sweep (Theorem 
found in Section 



T]) and the structured coalescent (Theorem [2]). All proofs are 
6Tl 



3.1 Fixation probability 



The following is a classical fact, which can be read off from Kimura (19621 



equation (14). It describes the fixation probability of the beneficial allele and 
is stated here for completeness. 
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Proposition 3.1 (Fixation probability) Let X he as in Definition 2.1 
a — > oo and Ea be such that ea\foL — > 0. T/ienrl 



,h=Q,^ „ , «^o" „ \a 



[ri<To] « 2e,W-. (3.2) 

Remark 3.2 (Fixation probability in a finite population) Consider a 
finite population of diploid size iV. Assume that Xq = 2^, meaning that there 
is only a single copy of the beneficial allele at time 0. For a — 2Ns, we find 
that, as A^ — ^ 00, 



,o.,h=o Ft ^T^i ^ 1 r^Ns _ 2s 



(This is exactly equation (15) in Kimura, 1962). Hence, a new recessive ben- 



eficial allele has a chance to be fixed in the population, which is much larger 
than 1/2N, its chance if it was neutral, and much smaller than 2hs, its chance 



if it would not be recessive (compare with Remark 4.2 1. However, note that 
it is not shown that the fixation probability for a single copy of the beneficial 
allele in a Wright-Fisher model has the same limit behavior; compare with 
Remark O 



3.2 Length of the recessive sweep 

Now, we come to the results on the duration of the recessive sweep. 



Theorem 1 (Length of the recessive sweep) LetT* be as in (2.3). Then, 



^Q.h=Or^*l '^''-cat 1 g^oo 3 log a 

^0 ^^ ^ ^ ' ^ ~ 2a ' ^'^■'^> 

where Ccat ~ 0.916 is Catalan's constant and some < c < oo. 
Remark 3.3 (Further investigation of T*) 



1. We find from (3.3) that 



rr,a,h=0^rp*, _ 2.067 3 lOg tt 

y'a 2a 

for large values of a. For a finite population of size N and a benefi- 
cial recessive allele with selection coefficient s, this means that it takes 
2.067 ^/nJs + 3\og{2Ns)/2s generations until fixation of the beneficial al- 
lele on average. As simulations show, this gives accurate numerical results; 



For two functions aa and ba we write a^ ~ ba iff lima—j-cx) r^ 
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< Xt < £ 


e < Xt <1- e 


1-e <Xt<l 


h = 


4Ccat log a 
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log a 
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Y^TTQ 2a 


0< h<l 


log a 
ha 


°0 


log a 


(1 - h)a 


h = 1 


logQ 


"0 


7r3/2 logQ 
2v^ 2a 


a 



Table 3.1 The leading terms for the expected duration of the time when the frequency of 
the beneficial allele is low (i.e. smaller than some e > 0), intermediate (between e and 1 — e) 
and high (above 1 — e). In case the leading term can only be determined up to a factor, we 
write £>(.). 
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Fig. 3.2 The duration of the sweep of a beneficial recessive allele, is plotted against a = 
2Ns. We have used a Wright-Fisher model, with TV = 10®. (A) Average length of the 
duration of the recessive sweep. (B) Empirical variance of the duration of the recessive 
sweep. 



see Figure [3^ A). In addition, a numerical investigation of (6.111 reveals 
that 



Vo'''=°[T* 



0.6362 

a 



(3.5) 



for large a. This value also fits well with simulations; see Figure |3.2[ B). 
Our calculations reveal not only limits for the expected total duration of 
the sweep, but also of the duration of the allele being in low (below some 
e > 0, not depending on a), intermediate (between e and 1 — e) and in 
high (above 1 — e) frequency. These results are collected in Table 3.1 
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Remark 3.4 (Deterministic approach) In the co-dominant case, the fixa- 
tion process is often approximated by a logistic increase of the beneficial allele; 



replaces the SDE 



11989 



e.g. Kaplan et al. (|1989); Stephan et al. (J992). Formally, this approximation 



by the corresponding ODE. Calculating the duration 



of the sweep as the time to reach 1 — 577 from j^ under the logistic results 
is a rough, but acceptable first approximation for the fixation time. This is 
different for the completely recessive case, where the corresponding ODE reads 



y = ay'^i^-y), 



yo = 



2N- 



r 



2N 
N^oo 2N 



The time t* to reach yt* — 1 — -^ from an initial frequency -^ i^ easily derived 

a V 2iV — 1 J a 

Note that the asymptotic behavior for a — > 00 of this expression differs con- 
siderably from the scaling T* ~ a~^/^ in (3.3) that we have found from the 
full stochastic equation. 



3.3 The structured coalescent 

In this section, we study the structured coalescent from Section [2?2] for ft, = 0. 
Theorem 2 (The structured coalescent for h = 0) 



Let K 



{JC'IC 



fi^l^l3)0<l3 



<y be as in Definition 



2.3 



pI \foi — > A jor some 0<A<oo as a ^ 00. Then, 



and p — Pa such that 



na,h=0 




Xt* = (l,0)|/Co = (1,0)] "~°°E^^''=°[e-^^ 



:-T')l 



[K't, + K^, = k\ICo = {n,0)] « c;,,„,fc 
for some < cx^n.k < 1. not depending on a. 



(3.6) 

(3.7) 



Remark 3.5 (Approximation of the genealogy for /i = 0) It is impor- 
tant to note that only the scaling of the recombination rate p by ^/a leads 



to a non-trivial limit result in (3.6) and (3.7 1. In applications, however, finite 



values of a and p must be assumed. In order to use the above Theorem for 
large a, we set A = p/\/ct. In this case, the Theorem implies that every single 
line changes from the beneficial to the wildtype background approximately at 
rate p = X^/a - see (3.6) - during the sweep. A first naive guess about the 



genealogy at the neutral locus is a star-like approximation as in the case of 
co-dominant alleles fDurrett and Schweinsberg , 2004 see also Section |4.2[) . 



This means that all lines recombine to the wild-type background (transition 
3.) independently at rate p and those which did not recombine coalesce at the 
beginning of the sweep. However, X* spends most time in frequencies of order 
\l \foL, leading to an increased rate of coalescence during the sweep. We sug- 
gest that the process /C can be approximated by a coalescence process (which 
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is not conditioned on X*) where two lines merge at rate y/a and each hne 
escapes the sweep at rate p. As we will see in Section^ this approximation (in 
contrast to the star-like approximation) produces reasonable numerical results 
and helps to explain some of the qualitative differences in frequency spectra of 
recessive and co-dominant sweeps. Note, however, that the approximation is 
not a formal convergence result. (Due to the fluctuations in X* around l/\/a 
such a formal result is not easily obtained). 

Remark 3.6 (Reduction of diversity and 0^,2,2) The most prominent 
attribute of a selective sweep is the reduction in sequence diversity, which can 
be quantified by the average reduction in heterozygosity due to the sweep. Let 
H2{t) be the heterozygosity at a neutral locus linked to the selected site. Then 
(see equation (16) in Kaplan et al.[[T989) 



P^'''="[iJ2(r*)]=p22-P[H2(0)], 

for 

P22 :- P^^-'^^iiir^ + K}^, = 2|/Co = (2,0)] « 0^^2,2- 

Indeed, assuming that there are no new mutations in [0,T*], two neutral loci 
picked at the end of the selective sweep are different if the ancestors at the 
beginning of the sweep are different (probability P22) and if these carry differ- 
ent alleles (probability P[iJ2(0)]). In particular, P22 captures the reduction of 
diversity within recessive sweeps. Using the approximate genealogy suggested 
in the last remark and a competing Poisson process argument, we find 

^. 2p ^ 2A 
P22^ ^^—^ ~ ^r--^- 3.8 

ya -f 2/3 1 + 2A 

We will see in Section [5] that this approximation produces a reasonable fit 
to simulations; see Figure [5T| ( A) . Note that the star-like approximation (de- 
scribed above) would lead to P22 ~ E[l— exp(— 2pr* )] , and hence would predict 
a reduction in sequence diversity which is weaker than seen in simulations (not 
shown, but compare with Figure [5JJ[ A)). 



4 Results on incompletely and completely dominant alleles 

In this section we give approximations for the fixation probability (Proposition 



4.1) and the fixation time (Theorem^ in the case < /i < 1. 



Fixation probability 

The next result is the complement of Proposition |3.1| for the case < /i < 1. 



Proposition 4.1 (Fixation probability) Let Q < h < 1, and Sa be such 
that £aOt — >■ jor a — > 00. Then. 



T>a,h 



[Ti < To] w 2haea. (4.1) 
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Remark 4.2 (Fixation probability in a finite population) For a finite 
population of haploid size 2N , and if the homozygote for the beneficial allele 
has selective advantage a — 2Ns, the above result means that 

Kn2N)[Ti<To]^2hs. 

Hence, the fixation probability of an incompletely dominant allele with domi- 
nance h and advantage s is approximately the same as for a co-dominant allele 
with advantage 2hs. 



4.1 Length of the sweep 

Next, we give our results on the duration of the sweep for < h < 1. 

Theorem 3 (Length of the sweep) 

1. Let < h < 1 and T* be as in ( |2.3[ ). Then, 
log(2a) 



ga^ftjj,* 



ah{l - h) 
^ (-1 + {2 - 'ih)\ogh + [ih - l)\og{l - h)) (4.2) 



ah{l — h) 
where 7 is Euler's constant, and 



Vo^"[r* 



-(- 



h^ [l-hy 



for some < c',c'' < 00, not depending on h. 
2. For h^l, 



E"'''=i[T*l- ^ 



3/2 



2^/^ 



3 log a 
2a 



(4.3) 



and 



,h=l 



[T* 



for some < c'" < cxd. 
Remark 4.3 (Further investigation of T*) 



1. The approximate expected duration of a sweep for < /i < 1 from (4.2) 



has already been obtained in van Herwaarden and van der Wal (2002) 



(using other methods) and we only give these results for completeness. 



Comparison with numerical results (Figure 4.1 ) show that (4.2 ) is accurate 



as long as ha ^ 1 and (1 — h)a ^ 1, i.e. dominance is intermediate. For 
nearly recessive alleles {ha of order 1) or nearly dominant alleles {{1 — h)a 
of order 1), however, T* is much better approximated by the formulas for 
h = ([33| or /i= 1 (|43l). 
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2. Although (4.2) is symmetric under the exchange of h and (1 — /i), this 



does not reflect an exact identity of the fixation time. As noted by |van| 



Herwaarden and van der Wal ( 2002 1 , an exact symmetry of the fixa- 



tion process under the Wright -Fisher diffusion is obtained under the map 
(a, h) I— >■ {—a, 1 — h), i.e. if fixation of a dominant beneficial allele is com- 
pared with a recessive deleterious one. For beneficial alleles, the approxi- 
mate symmetry under h i— > (1 — ft,) becomes much worse for nearly recessive 
or dominant alleles, as already observed by Teshima and Przeworski ( 2006 1 . 



From our formulas for the fixation times of completely recessive and domi- 



nant alleles, (3.3) and (4.3), this asymmetry is most obvious. In particular, 
the expected length of a dominant sweep {h — 1) 



Eg"'' 



'-^T*] 



2.7841 



/a 



3 log a 
2a 



is much longer than any sweep of an allele with the same homozygous 
advantage but 1 > ft, > 0. Most of this time is spent near frequency 1. For 



the variance, numeric integration in the last line of (6.19) results in 



=^[T* 



1.818 



(4.4) 



which is also larger than for ft = 0. See Figure 4.2 for comparison with 

simulations. 

As for the recessive case, our calculations also reveal the expected times 

for three phases of the sweep, with the allele being in low, intermediate 



and high frequency (cf. Table 3.1) 



4.2 The structured coalescent 



The approximation of the coalescent at the neutral locus in a sweep region 
by a star-like genealogy is widely used. A formal convergence result for the 



co-dominant case is due to Durrett and Schweinsberg (2004). Using our results 



from Section |6J this result can be extended to the general non-recessive case 
with < ft < 1. 



Theorem 4 (The structured coales cen t for < ft. < 1) Let < ft < 1, 

/C = (/Co, /Co')o<^<T' be as in Definition : 
A for some 0<A<cx) as a— >-oo. Then, 



2.3 



and p = Pa such that p\oga/a 



T>a,h 



»"'''[/Ct*=(1,0)|/Co = (1,0)] 
[ii-^-l-i^^. =2|/Co = (2,0)] 



r e-'/'\ 



-2\/h 



(4.5) 
(4.6) 



Remark 4.4 (Star-like genealogy, scaling and dominant alleles) 

1. Again, it is crucial to note that only by the scaling of p by a/ log a a 
non-trivial limit for the structured coalescent arises. 
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Fig. 4.1 The duration of the sweep of a beneficial allele is plotted against a. We have used 
the same parameters (except for dominance) as in Figure [3^ (A) h = 0.1, (B) h = 0.9. For 
ha < 10 (in (A), the nearly recessive case) and (1 — h)a < 10 (in (B), the nearly dominant 
case), the simulation curve crosses over to the predicted sweep times for h = | |3.3| l and 
h = 1 l|4.3|l, respectively. 
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Fig. 4.2 (A) The duration of the sweep of a beneficial dominant allele, h = 1, is plo tted 
against ^/a. We have used the same parameters (except for dominance) as in Figure |3.2| 
(A) expectation, (B) variance. 



The Theorem must be read as follows: any single line recombincs out of 
the sweep (i.e. experiences a transition 3., compare with Definition 2.3) 
with approximate probability 1 — e"^''*. Moreover, any two lines behave 
independently and coalesce if and only if both lines do not recombine. In 
particular, this interpretation shows the star-like approximation for the 
genealogy at the neutral locus. 
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3. It is now possible to compare the reduction in diversity of a beneficial allele 
with dominance h and selection coefficient ah and a co-dominant benefi- 
cial allele with selection coefficient ai/2- As the Theorem shows, the reduc- 
tion in diversity only depends on p\oga/{ah). Hence, ignoring logarithmic 
terms, the diversity reduction is approximately the same if 2hah = ai/2- In 
other words, the effect of a beneficial allele with selection coefficient a and 
dominance h is similar as for a co-dominant one with selection coefficient 
2ha. 

4. We can approximate the expected reduction in heterozygosity at the end 
of a sweep in the case < h < 1. Using the same notation as in Remark 



3.6 with pa/loga —^ A as in the above result, we find that 

¥f'[H2iT*)] « (1 - e-^V^) . p'^^'-[H2m 

« (1 - e-apioga/la/.)) . f>f'[H2{0)]. (4.7) 

The proof uses (and establishes) the fact that for a — > oo recombination 
and coalescence can only happen while the frequency of the beneficial allele 
is small {Xt < e, first phase in Table 1). Note that the result uses the 
scaling of the recombination rate p ^ a/ log a. This is in stark contrast to 
the recessive case where p must be of order ^/a. 

The above theorem does not cover the case h = 1. For such a dominant 
sweep, recall from Theorem [3] that the duration of the sweep is of the order 
of 1/-\/q? in the diffusion time scale. However, this order is due to the final 
phase of the fixation process, where the frequency of the beneficial allele is 
near 1. During this final phase, it is possible that additional recombination 
events to the wild-type background occur, plus back-recombination events 
to the beneficial background. In particular, (i) in the proof of Theorem [4] 
is not true in the case h — 1. 

Our result shows that methods, which rely on the star-like approximation 
(such as Nielsen et al. (2005[)), can easily be adapted to the case of in- 



termediate dominance. In contrast, they are expected to fail for recessive 
sweeps, i.e. ha of order 1, even if selection is very strong. Note that the 
scaling result for the star-like approximation uses the limit a — >■ (X) for 
constant h and should therefore only be applied to cases where ha S> 1. 



5 Simulations 

In this section, we show simulation results indicating the implications of our 
findings for data analysis. All simulations were done using the program msms 



(Ewing and Hermisson, 2010). This program is able to simulate samples of 
arbitrary size of homologous genetic material of any sequence length for ar- 
bitrary selection scenarios on a single bi-allelic locus, including temporally or 
spatially heterogeneous selection and dominance. 

We concentrate our simulations on the reduction of sequence diversity at 
the neutral locus. In particular, we compare the effects of selective sweeps 
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for recessive and co-dominant alleles. Since the same value for the selection 
strength a will lead to a much weaker genetic footprint for the recessive case, 
/i = 0, as compared to ft. = 0.5, we give a heuristic argument in Section 
|5.1| that describes which values of a correspond to a footprint with a given 
strength for recessive and co-dominant alleles. The resulting sequence diversity 



is described in Section 5.2 A power analysis of Tajima's D, a common test 
statistic to reject the standard neutral model, is given Section [5. 3| 



5.1 Comparison of recessive and co-dominant sweeps 

The difference of recessive and co-dominant sweeps is best seen from the ap- 



proximations of the genealogy at the neutral locus; compare Remarks 3.6 and 
4.4 In particular, we have seen that a star-like approximation can be used 
for strong selection and ft > (see (4.7)), while the case of ft = is best 
described by competing Poisson processes for recombination and coalescence 
with constant rates; see (3.8). Moreover, note that (4.7) and (3.8) already 



give first-order estimates for the reduction of heterozygosity due to a selective 
sweep. 

We can ask, which selection strength olq for a recessive allele is needed to 
obtain the same expected reduction in heterozygosity as a co-dominant sweep 



with a given selection coefficient a^i2- Equating (3.8) and (4.7), 

P — I _ g-4plogai/2/ai/2 

we obtain the following condition for small p, 

/ "1/2 \2 
V21ogai/2/ 

We note that this relation produces sweeps of almost identical total length. 



(5.1) 



Indeed, equating the leading order terms for the fixation times (3.3) and (4.3) 
leads to a selection coefficient for the recessive case of oq sa 1.07ao, with ao 



given in (5.1 ). In our simulations, we will use the particular pair ai/2 = 1000 
and cko = 5300, which approximately fulfill (5.1). 



5.2 The reduction in sequence diversity 

A commonly used measure for sequence diversity is the nucleotide diversity, 
which we denote by 0t- For a sample of n homologous sequences of neutral 
loci, 

- 1 



E 



Hi. 



\2/ l<i<j<n 

Here, Hij is the number of differences between the ith and jth locus. Recall 
that 0T is an unbiased estimator for the population mutation rate 9 under the 



standard neutral model (Tajima 1983) 
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Note that E[9t] = E[iJ2], where H2 is the heterozygosity in the population. 
Hence, we can use (4.7 1 and (3.81 for a prediction of 6t at the end of the 
selective sweep. We have tested these predictions using coalescent simulations 
under the infinite sites model of a stretch of DNA for a sample of size n = 50 
for several pairs of selection coefficients according to (5.1). Results for the 

The match 



5.1 



particular pair ai/2 = 1000 and ao = 5300 are shown in Figure > 
for the reduction in heterozygosity is good as long as p <^ a (compare the slope 
for 9t near p = ior h — and h — 0.5). For larger values of p, there are 
several competing forces and all approximations show deviations. For h = 0.5, 
the star-like approximation overestimates the length of the genealogy and gives 
a too small valley of reduced heterozygosity. For h = 0, our Poisson process 
approximation assumes that we can neglect fiuctuations of the frequency of the 
beneficial allele around i/y/ct- The simulations show that a too broad valley of 
reduced heterozygosity is predicted (see Figure [5^ A)). Note that the star-like 
approximation for h — 0.5 and the Poisson process approximation for h = 0, 
give errors of comparable magnitude. 



5.3 The power of Tajima's D 

Another unbiased estimator for the population mutation rate 6 is given by 
Watterson's 9, 

^ S 



n-l 1 ' 



En — J 
2 = 1 

where S is the total number of single nucleotide polymorphisms (SNPs) in all 



n lines ( Watterson 1975 ). In order to reject the neutral model, we consider the 



frequently used statistics Tajima's D (Tajima 1989). This statistics was one 
of the first test statistics for neutrality tests and is proportional to 9t — 9w 
It is a classical result that Tajima's D is negative for selective sweeps. 

The si mulat ions show a more shallow increase for 9\y in the recessive case 
(see Figure 5.1 |A)). This implies a smaller difference of 9w and 9t and hence 



a less negative value for Tajima's D relative to sweep with h — 0.5, as seen in 



Figure 5.1 |B). Also, this finding is easily understood from the approximated 
coalescent histories. With a star-like genealogy, only single lines of descent 
recombine out of the sweep, leading to an excess of singletons (more generally: 
low frequency polymorphisms) relative to the neutral expectation. This bias in 
the frequency spectrum is indicated by a shift to negative values of Tajima's D. 
In contrast, coalescence events will often occur prior to recombination events 
under a competing Poisson process scheme, valid for recessive sweeps. As a 
consequence, lines of descent that recombine out of the sweep will often have 
multiple descendants among the sequences in the sample. The shift towards 
low frequency polymorphisms is therefore less pronounced and Tajima's D 
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Fig. 5.1 (A) The average reduction of nucleotide diversity Sy and Watterson's estimator 
6jy for the population mutation rate d after a sweep. (B) Tajima's D as a function of the 
recombination distance, p, for co-dominant and recessive sweeps. 



less negative. Similar results have previously been reported for moderately 
recessive sweeps (/i = 0.1) by Teshima and Przeworski (20061. 



The approximate genealogy with constant coalescence and recombination 
rates is equivalent to a model of a structured population with constant mi- 
gration among two islands representing the beneficial and the ancestral back- 
ground. Depending on the migration rate (and the sampling scheme) both a 
surplus (D < 0) and a deficit {D > 0) of low frequency polymorphisms can 
result under this scenario. For the recessive sweep, we see that the expected 
Tajima's D indeed turns positive at a larger recombination distance to the 
selected site. 

Since the deviation of Tajima's D from zero (in either direction) is relatively 
smaller for recessive sweeps, we expect that also the power to detect selection 
in a simple test based on D is reduced. This is confirmed by the power table 



in Figure 5.2 consistent with previous results on partially recessive sweeps by 



Teshima et al. (20061. 



6 Proofs 



All our assertions are dealing with the case a — )■ oo. Hence, all 'w' in our 
proofs arc to be read as ~ . Frequently, we make use of a sequence {ca)a>Q 
of real numbers with Cq — > oo slowly with a — > oo. For example, we write 



/O ^ J Co, ^ J Co. S 

Our proofs are based on classical one-dimensional diffusion theory; see e.g. 



Karlin and Taylor (1981); Ewens (2004). The solutions of the SDE as given in 



18 



G. Ewing et al. 



(A) 



(B) 



67 


49 


37 
42 


28 
32 


22 
24 


18 
19 


76 


56 


3 


60 


46 
49 


34 
36 


26 

27 


20 

23 


' 


65 


) 


68 


50 


37 


28 


22 


f 


69 


51 


36 


27 


19 



\ I I r 

20 40 60 80 100 120 




20 40 60 80 100 120 



Fig. 5.2 The power of Tajima's D, depending on dominance, recombination distance and 
time since completion of the selective sweep. (A) co-dominant and (B) recessive, parameters 
as in Figure [5Ti|B). 



(2.1) have infinitesimal mean and variance 

/i(a;) = a{h + x{l — 2h))x{l — x), a {x) = x{l ~ x). 



The scale function is 

Six) 



I cxp{-2ha£,-a^^{l-2h))dC 
Jo 



(6.1) 



Since and 1 are exit boundaries for the diffusion, and for Tq and Tias defined 



in (2.2) 



^^■"[Ti < To] = 



Six} 
Sil) 



g-2/iai,-aj)^(l-2/i)^, 



V 



(6.2) 



^-2hari-mf [l-2h) 



drj 



The Green function for X* , the diffusion which is conditioned to hit 1 and 
started in x, is given by 



G*(x,e) 



2(5(1)^5(0) SiO 

S{1) ^^iOS'iiY 



x<^, 



2(5(1) - Six))SiO SiO 
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-2hari-mf{l-2h) 



drj I e 



-2hari-mf (l-2h) 



drj 



^(1 _ ^\g--2ha^~ae{l--2h) I ^-^2ha,i-a,f (l-2h) ^^ 



^-2hari-arj-' {l-2h) 



drj 



T^ 

$(1 - ^)e-2'>ae-Q52(i-2/i) / 

"'0 



^-2hari-ari^ (l-2h) 



drj 



i 2 

^-2har,-arj^il^2h),\ 







'2hari-ari^{l-2h) 



drj 



x<^, 



(6.3) 



X > ^. 



Here, G*{x,^)d^ is the average time the conditioned diffusion spends in 
{£,,£, + d£,) before hitting 1. In particular, we wiU use that 



Eo[r* 



G*{0,Od^, ¥„[T*]=2 [ [ G*{0,OG*{^,v)dvdt (6.4) 
/o Jo Jo 

While the first identity is a classical result in diffusion theory, the second 



identity can e.g. be read off from Dawson et al. (2001 equation 2.1.1) 



6.1 Proofs from Section [s] 

In this Section, we fix /i = 0. We will start with our key Lemma [6.3[ which 
states that a scaling of the diffusion X immediately gives our result that the 
sweep length is of order l/^/a. Afterwards we provide proofs for Proposition 
[XT] and Theorems [1] and [51 



A key lemma 



In order to study the diffusion (3.1 1 and the structured coalescent, we use a 



rescaling argument. We need a definition first. 

Definition 6.1 The diffusion y — (3^r)T>0' which takes values in [0, oo], is 
the unique solution of the SDE 

dY = Y^dr + VVdW, 

for which Y^ = oo is a trap. We set 

Ty ■= infJT >Q:Yr = oo}. 

The process y* = {Y*)r>o is given as y , conditioned on the event {Ty < cx)}. 
We write Ty for Ty, also conditioned on {Ty < co}. 
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Remark 6.2 (The diffusion y*) The diffusion 3^* will play a crucial role 
in our proofs. Note that the scale function of this diffusion is 






For the Green function of y* , 



G^'*(x,C)-< 



e-"! dri / e-'' drj 



^e 



-e 



x<^, 



2 / [^ 2 \2 

e"" dry( / e"" dry) 
jP -. ->e 

Jo 



(6.5) 



Lemma 6.3 (Key lemma) LetT* ,X* he as in Definition 2.1 andTy,Ty,y 



be as in Definition 6.1 such that X* and y* are started in 0. Then, for X* 



X* ^ y* as a -> oo. 



Moreover, 



Eo° r]"7sEo[r:^] 



3 log a 



(6.6) 



(6.7) 



EoiTv] = —i=Ccat, 

where c^at ~ 0.916 is Catalan's constant. In particular. 



/^■T* => n, 



as a ^ cx). 



(6.9) 



Proof For (6.6), we start by a change of variables 

Xt = y/aX^i^. 
By changing the time scale to dr = \fadt, we obtain by Ito's formula that 

dX = ^fadX = ^faX^{\ - X/^/a)dt + \ a^{l - X/^/a)dW 



X\l - X/^/ajdT + \^X{1 - Xl^/^)dW. 



Since a — >■ cx), we see that 



X ^ y as a -^ CO 
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which also impUes (|6.6|) 

] 
gral 



For (|6.8p, with a Uttle help from Mathematica to evaluate the last inte- 



MTy]= / G^'*{0,Od^ 











^e-«^ 



-dC 



(6.10) 



For (6.7), recall the Green function for the diffusion X* from (6.3). We use 



2^ 



e''' drj 



'•n rir, — 







1 



1 1 



^(1-^/ys) c vs-c 



and (6.10) to obtain 



Jo V"'" Jo 



^ j/"e-'^'d7jj^e^'^'dr, 



with (recall the sequence c^ going to infinity slowly enough) 



de 



^1 



J^ e-^ df] J^ e-''' dr^ 

poo „„„2 , c^ , 



V^ 



de«-^ 



1 






1 






1 

"2,/S 









As:= 



Va/2 ^ 

(V^-C)e-«^ 

AT ' 



^^ J^/2 e 






d^ 






d^ 
8a3/2 ' 






Zi4 



dC: 



//^e- (''-«)(''+«) d77 



^log(VS) = ^loga, 
V^ J^ e-^i" dr, J^ e-^" dij 



/a-C 



dC 



^-e ""^^"47^ 



7^ 



1 - e-i^^ 



e 



-dC 



v^ 



/"Cc I _ g-2e 



%/S-e 



v^ 



-dc 



-'^^^V^log" 



The result (6.7) follows. 
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Proof of Proposition 3.1 



Due to Leninia |6.3[ we can compute, by classical diffusion theory, 






sy{oo) 



lo e-^'drj 



\foi ■ Ea ■ 



and we are done. 
Proof of Theorem [T] 



Clearly, (3.3 1 is a combination of (6.71 and (6.8). For (3.4) we use (6.4 1 and 
write 

, ., ... ...2/ ^ o x2 

aVn'''=°[T*l = 8a / /" ^^^ '-^^ '- d^di 



'0 



[T1 = 8a /' /' — 



oo ^i 

Jo 






— OL^ p—OiVp' 



drjdi < cx). (6.11) 



Proof of Theorem [2] 

We use the same notation as in Lemma 6.3 (in particular ^/adt = dr) and 
transform the structured coalescent to a structured coalescent conditioned on 



the process y. For (3.7 1, note that the coalescence rate at time t is given by 
1 . a/S . 1 



-dt 



Recombinations occur at rate 



-dr 



Yr 



dr as a — > OO. 



Xr 



pil - Xt)dt = -^[l - ^]dT p^ X. (6.12) 

For back recombination from the wild-type to the beneficial background. 



pXtdt 



P X, 



dr K 0. 



All three limits are valid for all < r < T- 



y- 



So, for the structured coalescent conditioned on 3^, we find that the coales- 
cence rate is l/Yrdr^ and the recombination rate is A. During times when Y^ is 
of order 1, we see that both, coalescence and recombination happen at rate of 



can be ignored, by (6.12) 



order 1. In particular, we have shown (3.7 1. For (3.6), since back recombination 



T,a,h=0 



[K.T' = (1, 0)|/Co = (1, 0)] = E^^''=°[e-^(^-^*)]. 
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6.2 Proofs from Section |4] 



We directly use (6.2 1 in order to prove Proposition 4.1 and (6.4) in order 
to prove Theorem l3j We will use several approximation results on integrals 
appearing in these equations. We will use 



/•a Q^ i^Oi i^o 



3-2«"d^ 



as well as 



ee-^'^dS, : 



4a3' 



(6.13) 



^-2hr,+ (2h-l),l'/a,^_ 



'^^^ ?7t:(1"S 



-2/1? \ 



2h 



\e 



2h-l /■« 



2h-n(j2h-l)-n''/a 



l)dri 



-2hn 2, C(l - e-2''?) + rfe-2''? 



(6.14) 



for ^ < a and fixed < /i < 1, for some < c, d < oo. The last equation 
implies for ^ = a 



^-2h-n+{2h-l)vi^/a^^_ 



1 _ 2/1- 1 
2h ^ Ah^a ■ 



(6.15) 



Proof of Proposition |4.1 



For < ft- < 1, we write, using (|6.2) and (6.15), 



[Ti <To] = ^ 



2has„. 



^-2hTi+{2h-l)ri^/c 



dr] 



Proof of Theorem [3] 

1. For < ft < 1, we write with (6.15) 

^-2h7j+(2h-l)7j^/a 






drj I e 



-2hrj+{2h~l)rj-'/a 



drj 



-d^ 



4ft 



a 



^(Q,__^)e-2'j«+(2/i-i)CV" / e~^'"'+(^''"^)'' /"dry 

Jo 

(Ai + A2) 



(6.16) 
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with 



^-2hri+{2h-l)ri'/a^^ / ^-2h-n+{2h-l)-n' /a 



J^ 



^^^2hi+{2h~l)e/a 

^-2hr,+{2h-l)r,^/a^^ f ^~2hr,+{2h~lW /a ^^ 

We use 

'■^ 1 - e^'' 

d?7 — log ^ ~ 7 

V 

for .^ ^- oo and, if < ^ < a, 

j^ e-2hv+i2h-l)vVo^dr] 1 _ g-(2h-(2h-l)2C/a)(a-e) 

e-2/i«+(2h-i)«2/" 2/t - 2(2/;, - l)^/a 

g-2/ir,+ (2ft-l)(2r,C+rj^)/a^^ _ f ^- {2h- (2h-l)2i / a.)r) ^^ 

/O "'0 

^-(2h-2(2h-l)i/a.)7U^C2h-l)7f/a _ ^\^^ 
/O 
_ 2fe- 1 /■" ^-(2h-2{2h-l)aa)v^2^^ 

(^ Jo 

^(^l _ g-(2h-2(2h-l)e/a)(a-C)-) ^ ^p-(2/i-2(2h-l)e/a)(a-e) 

a 

for some c, d > 0, not depending on a, in order to obtain 

4/iAi--(log(2/ia)+7) 

»a Kg-2hr,+ (2/i-l)r,Va^^ f'^ g-'2hrt+{2h-l)n^ /a ^^ 

1 r^''" 1 - e-« ,^ 



(6.17) 



/lio ^ ^2/i-2(2/i-l)e/a^ 

1 ?=o 

log (2/i - 2(2h - l)i/a) 



h 



?=a 



-^(log(2/j)-log(2(l-/i))). 
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Here, we have used (6.14) and (6.17) for the second w. In addition, 
1 



AhAy 



1-h 
Ah 



(log(2(l-/i)a)+7) 

{a - ^)e-2/>«+(2h-i)4V" 



d^ 



l-h 







e 



l-h\Jo 



1 _ g-(2/i-(2/i-l)25/a)(a-5) 2(1 - h) 



i^a-i 1 



l-/i 



1 _ g-2(l-/i)C 
1 _ g-2(l-/i)C 



2h-2{2h- l)C/a 



d^ 



d^ 



2{l-h) 



^ 2(1 - /i) - 2(2(1 - /i) - l)^/a 

1 _ g-2(l-/i)5 



d^ 



The last hne is exactly line (6.18) with h replaced by 1 — /i, and hence we 
obtain 



Eg-'n- „y"L ^ l(l(7 + 21og/^-log(l-fe)) 



a/i(l — /i) 



+ y^(7 + 21og(l-/i)-log/i)) 

^ ^7+(2-3/i)log/i+(3/i-l)log(l-/i) 



Q;/i(l — /i) 



The variance is computed by (6.4) and 
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Jo ^(1 -^)77(1 - 77)e-2'ia« + (2ft-l)aC2g_2/ia»7+(2/i-l)a»,2 
(/''e-2'-C+(2/.-l)aC^rf^y 

(/o^e-2''"C+(2ft-i)oC^dC) 
X ^(a - C)r?(a - 7/)e-2'«?+(2ft-i)?V"e-2'»')+(2/i-i)»,V" 
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dc 
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c^ 






drjd^ 



( 



{l-hf 



for some finite c', c". 

2. Now, we come to the case h = 1. Here, we use the scale function (which is 

up to a factor of e~" the same as in (6.1 \ ioY h — 1) 

S{x)^ exp f - 2a / (1-77)^77) de= / e^^'d^ 



Then, 



E"'''=ifT*]=2 / — 
^ f(l 



J^ e°"'^ dr] C e°"'^ df] 



2 ri _^n2 . 



df 



/« A e(l - e/V^)e«' /o^ e"'d77 
2 



-d^ 



'■v^„»7^ 



rv^ „»)^ , 



such that (recaU ( |6.13 l), and using J/" e^ drj — /q e'' drj — J^ e'' d?7 and 
Mathematica for the first term 



A 



J^e^'dv^^_^y^ 



Ce? 



4 ' 
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For the variance, 



^l-e-2? log a 
at w — ^. 



aVo'''='[T*] =8a 



= 8a 



1 /•« 



r^'^-^^'^C) (A%e"C^dC 



(/o^e"C^rfC) e(l-0^(l-»y)e"(i-«)'e"(i- 



-drjdS, 



■vY 



a f-Ja 



8 / / ^ , ,. „/ de^T? < ex.. 



-dr]dS, 



(6.19) 



Proof of Theorem |4] 

The proof uses an approach similar to the proof of Proposition 3.4 in Etheridge 



et al. (20061. Recall the structured coalescent from Definition 2.3 We need to 



show the following assertions in the limit of large a: 

(i) any line never undergoes a transition 4. (back recombination to the ben- 
eficial background). 

(ii) any pair of lines never undergoes a transition 2. (coalescence in the wild- 
type background). 

(iii) any pair of lines never makes a transition 1. (coalescence in the beneficial 
background) and the coalesced line then makes a transition 3. (recombi- 
nation to the wild- type). 
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(iv) the probability for each Hne to stay in the beneficial background (i.e. no 
transition 3., recombination to the wildtype background) until the origin 
of the beneficial allele is e"'*'/''. 
(v) if two lines do not coalesce (no transition 1.), recombination to the wild- 
type background (transition 3.) occurs for both lines independently. 

Let us explain how these five assertions imply Theorem |4j Consider a single 
line, i.e. /Co — (1)0)- When (i),...,(v) are shown, (i) and (iv) immediately imply 
( |45l ). Next, consider /Co = (2,0). For ([46]), the event Ki^, + K'^, = 1 means 
that an event 1. has occurred (since 2. does not occur by (ii)). In addition, 
the lines coalesce at /3 = T*, the beginning of the sweep. Otherwise there 
would be the chance that the coalesced line recombines through a transition 
3., which is not possible by (iii). In particular, the last two arguments show 
that ICt' = (1, 0). Hence, /Ct* — (1, 0) iff both lines do not make a transition 
3. if coalescence cannot occur. This probability is e~^^/'^ by (v). 



The main point in showing the star-like approximation is (iii), as this as- 
sertion exactly implies that the structured coalescent converges to a star-like 
tree. So we start with proving this. Since transitions 1. occur at rate 1/X^ and 
transitions 3. at rate p{l — X^), and using the Green function identity from 
Dawson et al. (2001[ equation 2.1.1) in the third line, 



'[event described in (iii) occurs] 



= E^ 



f f p{l-Xs)exp(^- f p{l + lr>t)il - Xr)dr'^ 
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log a 



(6.20) 



where Bi and B2 are given below. For J5i, we have for some finite c, using 



(6.3) 



Bi^] f f G*{0,OG*{C,vKl-0^dr,d^ 
4 Jo Jo V 
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For i?2, we compute, with some finite c', 
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(6.21) 
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(6.22) 



where each integrand in J • • • d^ is the same as the first integrand in the re- 
spective Une. Combining (|6.20|, (6.21) and (6.22) shows that 



:^0 



P [event described in (iii) occurs] 
which proves (iii). Similar calculations show that 



P[event described in (i) occurs] 

Jo Jo 
P[event described in (ii) occurs] 

<pffG*{Q,OG* it ^/) T^ (1 - V)drid£, ^^^^ 0, 

Jo Jo 4 — ^ 
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and (i), (ii) follow. For (iv) and (v), note that the probability that a line does 
not recombine at all is 



P[event described in (iv) occurs] — E exp { — P 

and (v) is equivalent to 

P [neither of two lines recombines] 

exp ( - 2p / (1 - Xs)ds\ 



p I (1 - X,)ds^ 



= E 
Hence, (iv) and (v) are proved once we show that 

E exp f - p / (1 - Xs)ds] 
In order to see this, we show that 

P ^^' 



^-2X/h 



o-Vi 



(6.23) 



- / (1 - Xs)ds 



in L^ . Since L^-convergence implies convergence in distribution, and x ^-^ e ^ 



is bounded on IR+ and continuous, (6.23) then follows. We compute 
pI il-X,)ds\=p f G*{0,O{l-Od^ 
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(6.24) 



with Al from (6.16), where we have used (6.18). Next, 

rT' 



p- I (1 - Xs)ds 





/ G*iO,OG*i^,v)(.l-Oa-v)drid^ 

2Ttrrm*l Q— i-OO 



< 8p'Y[T*] ^^^^ 



(6.25) 



by Theorem [3] Co mbining (6.24) and (6.251, L^ convergence follows and we 
have proved (6.23) as well as Theorem Bj 
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